Overview

**Sparkify** is a Music platform and has Customer Churn problem. Churn defines cancellation of the subscription.

In this case we are going to predict potential churner subscribers. If **Sparkify** detects the subscriber who will be churner, they can take actions to labeled subscribers like discount, no ads and so on.

The data provided from **Sparkify** Company has

First of all we analyzed the dataset and created some features about subscribers. After Feature engineering steps we went to modelling steps and tried some ML alogrithms for predict the potential churner subscribers correctly. In the evaluation step we used F1 Score because the dataset is imbalanced. Accuracy is not a good metric for this problem.

Finally We achieved 0.8 F1 Score with Crossvalidated Random Forest Classifier.

Load and Clean Dataset

In this workspace, the file is medium_sparkify_event_data.json provided by Sparkify Company

Import Libraries and Setup Environment

Exploratory Data Analysis

In this part I analyzed raw data and I want to show the dataset details

Dataset including 543K Rows, 18 Columns and 449 Distinct customers

As you can see in the null values table. There are some patterns.

User-level information

These columns contain data about users: **their names, gender, location, registration date, browser, and account level (paid or free)**.

We have 15700 rows of empty strings in **userId** column which means we don't have any information about these users.

Log-specific information

Log-specific information shows how a particular user interacts with the service.

Song-level information

Information related to the song that is currently playing

Only **NextSong page** has song information.

Data Visualization

We'll start explore the behaviors of users who stayed and who left.

Feature Engineering

Based on the above analysis, we need the following features:

Target:

total song listened

Number of thumbs down

Number of thumbs up

Lifetime in day

Average songs played per session

Number of songs added to playlist

Total number of friends

Help page visits

Settings page visits

Errors

Downgrade

os

location

Correlation Matrix

VectorAssembler

StandardScaler

Modeling

Baseline Model

Baseline model will be target values are all 0s which means **no users has canceled this subscription**. We compare **Logistic Regression, Random Forest, Gradient Boosted Trees, and Support Vector Machine** with baseline models. If we achieve a better score than the baselineline, it is good.

Logistic Regression

Random Forest

Gradient Boosted Trees

Support Vector Machine

Evaluate models

Conclusion

In this project, I have learned how to deal with big data using Pyspark. This dataset allows me to practice customized data visualizations and churn analysis by creating predictive feautres. I trained four models: Gradient Boosted Trees, Random Forest, Logistic Regression, Support Vector Machine. **Random Forest** is proven to be the best model among them. The main chanllenges for me is feature engineering especially feature selection. In the future, I plan to group location variable to reduce dimensions and then perform chisq test on it. Also, more advanced ML algorithms can be applied in this dataset.